首页> 外文OA文献 >Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

【2h】

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

机译：编译器辅助工作负载合并以实现高效动态 GpU上的并行性

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

GPUs have been widely used to accelerate computations exhibiting simplepatterns of parallelism - such as flat or two-level parallelism - and a degreeof parallelism that can be statically determined based on the size of the inputdataset. However, the effective use of GPUs for algorithms exhibiting complexpatterns of parallelism, possibly known only at runtime, is still an openproblem. Recently, Nvidia has introduced Dynamic Parallelism (DP) in its GPUs.By making it possible to launch kernels directly from GPU threads, this featureenables nested parallelism at runtime. However, the effective use of DP muststill be understood: a naive use of this feature may suffer from significantruntime overhead and lead to GPU underutilization, resulting in poorperformance. In this work, we target this problem. First, we demonstrate how anaive use of DP can result in poor performance. Second, we propose threeworkload consolidation schemes to improve performance and hardware utilizationof DP-based codes, and we implement these code transformations in adirective-based compiler. Finally, we evaluate our framework on two categoriesof applications: algorithms including irregular loops and algorithms exhibitingparallel recursion. Our experiments show that our approach significantlyreduces runtime overhead and improves GPU utilization, leading to speedupfactors from 90x to 3300x over basic DP-based solutions and speedups from 2x to6x over flat implementations.

机译：GPU已被广泛用于加速计算，该计算展现出并行性的简单模式（例如平面或两级并行性）以及可基于输入数据集的大小静态确定的并行度。但是，将GPU有效地用于表现出并行性复杂模式的算法（可能仅在运行时才知道）仍然是一个难题。最近，英伟达（NVIDIA）在其GPU中引入了动态并行（DP），这使得可以直接从GPU线程启动内核成为可能，从而在运行时实现了嵌套并行性。但是，必须始终了解DP的有效使用：天真地使用此功能可能会遭受大量运行时开销，并导致GPU使用不足，从而导致性能不佳。在这项工作中，我们针对此问题。首先，我们演示过分使用DP会导致性能下降。其次，我们提出了三种工作负载合并方案，以提高基于DP的代码的性能和硬件利用率，并在基于指令的编译器中实现这些代码转换。最后，我们在两类应用程序上评估我们的框架：包括不规则循环的算法和表现出并行递归的算法。我们的实验表明，我们的方法显着减少了运行时开销并提高了GPU利用率，与基于DP的基本解决方案相比，加速因子从90倍提高到3300倍，与平面实现相比，加速因子从2倍提高到6倍。

著录项

作者
Wu, Hancheng; Li, Da; Becchi, Michela;
展开▼
作者单位

展开▼
年度 2016
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Adjusting Thread Parallelism Dynamically to Accelerate Dynamic Programming with Irregular Workload Distribution on GPGPUs [J] . Chao-Chin Wu, Jenn-Yang Ke, Heshan Lin, International journal of grid and high performance computing . 2014,第1期

机译：动态调整线程并行度以加快GPGPU上不规则工作负载分布的动态编程
2. Multi-level Parallelism for Time- and Cost-efficient Parallel Discrete Event Simulation on GPUs [J] . Georg Kunz, Daniel Schemmel, James Gross, Proceedings of the Workshop on Principles of Advanced and Distributed Simulation . 2012,第Null期

机译：用于GPU上的时间和成本有效的并行离散事件仿真的多级并行
3. LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs [J] . Jin Wang, Norm Rubin, Albert Sidelnik, Computer architecture news . 2016,第3期

机译：LaPerm：用于GPU上的动态并行性的位置感知调度程序
4. Compiler-Assisted Workload Consolidation for Efficient Dynamic Parallelism on GPU [C] . Hancheng Wu, Da Li, Michela Becchi IEEE International Parallel and Distributed Processing Symposium . 2016

机译：编译器辅助的工作负载合并，可在GPU上高效地进行动态并行处理
5. Dynamic Parallelism in GPU Optimized Barnes Hut Trees for Molecular Dynamics Simulations [D] . Carranza Zuniga, Melisa 2017

机译：GPU优化的Barnes小屋树中的动态并行性，用于分子动力学仿真
6. Dynamic parallelism for synaptic updating in GPU-accelerated spiking neural network simulations [O] . Bahadir Kasap, A. John van Opstal -1

机译：GPU加速尖峰神经网络仿真中用于突触更新的动态并行性
7. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme [O] . Pai Sreepathi, Govindarajan R, Thazhuthaveetil Matthew J 2012

机译：使用编译器辅助的运行时一致性方案对GPU进行快速高效的自动内存管理

Compiler-Assisted Workload Consolidation For Efficient Dynamic Parallelism on GPU

摘要

著录项

相似文献

相关主题

期刊订阅